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ABSTRACT 

Over the past five years, graphics processing units (GPUs) 
have had a transformational effect on numerical lattice quan- 
tum chromodynamics (LQCD) calculations in nuclear and 
particle physics. While GPUs have been applied with great 
success to the post-Monte Carlo "analysis" phase which ac- 
counts for a substantial fraction of the workload in a typi- 
cal LQCD calculation, the initial Monte Carlo "gauge field 
generation" phase requires capability-level supercomputing, 
corresponding to O(f00) GPUs or more. Such strong scaling 
has not been previously achieved. In this contribution, we 
demonstrate that using a multi-dimensional parallelization 
strategy and a domain-decomposed preconditioner allows us 
to scale into this regime. We present results for two popular 
discretizations of the Dirac operator, Wilson-clover and im- 
proved staggered, employing up to 256 GPUs on the Edge 
cluster at Lawrence Livermore National Laboratory. 

Categories and Subject Descriptors 

G.1.3 [Numerical Linear Algebra]: Linear systems (di- 
rect and iterative methods; G.1.8 [Partial Differential 
Equations]: Domain decomposition methods, Finite differ- 
ence methods, Iterative solution techniques; J. 2 [Physical 
Sciences and Engineering]: Physics 

Keywords 

Lattice QCD, GPU, Krylov solvers, domain decomposition 



*These authors contributed equally to this work. 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

SC11 November 12-18, 2011, Seattle, Washington, USA 
Copyright 2011 ACM 978-1-4503-0771-0/1 1/1 1 ...$10.00. 



1. INTRODUCTION 

Lattice QCD (LQCD) is one of the original computational 
grand challenges jr . Increasingly accurate numerical solu- 
tions of this quantum field theory are being used in tandem 
with experiment and observation to gain a deeper quanti- 
tative understanding for a range of phenomena in nuclear 
and high energy physics. Advances during the last quar- 
ter century required prodigious computational power, the 
development of sophisticated algorithms, and highly opti- 
mized software. As a consequence LQCD is one of the driver 
applications that have stimulated the evolution of new archi- 
tectures such as the BlueGene series [2]. Graphics process- 
ing unit (GPU) clusters challenge us to adapt lattice field 
theory software and algorithms to exploit this potentially 
transformative technology. Here we present methods allow- 
ing QCD linear solvers to scale to hundreds of GPUs with 
high efficiency. The resulting multi-teraflop performance is 
now comparable to typical QCD codes running on capabil- 
ity machines such as the Cray and the BlueGene/P using 
several thousand cores. 

GPU computing has enjoyed a rapid growth in popular- 
ity in recent years due to the impressive performance to 
price ratio offered by the hardware and the availability of 
free software development tools and example codes. Cur- 
rently, the fastest supercomputer in the world, Tianhe-f A, 
is a CPU-based system, and several large-scale GPU systems 
are either under consideration or are in active development. 
Examples include the Titan system proposed for the Oak 
Ridge Leadership Computing Facility (OLCF) and the NSF 
Track 2 Keeneland system to be housed at the National In- 
stitute for Computational Sciences (NICS). 

Such systems represent a larger trend toward heteroge- 
neous architectures, characterized not only by multiple pro- 
cessor types (GPU and conventional CPU) with very dif- 
ferent capabilities, but also by a deep memory hierarchy 
exhibiting a large range of relevant bandwidths and laten- 
cies. These features are expected to typify at least one path 
(or "swim-lane") toward exascale computing. The intrin- 
sic imbalance between different subsystems can present bot- 
tlenecks and a real challenge to application developers. In 
particular, the PCI-E interface currently used to connect 



CPU, GPU, and communications fabric on commodity clus- 
ters can prove a severe impediment for strong-scaling the 
performance of closely coupled, nearest-neighbor stencil-like 
codes into the capability computing regime. Overcoming 
such limitations is vital for applications to succeed on fu- 
ture large-scale heterogeneous resources. 

We consider the challenge of scaling LQCD codes to a 
large number of GPUs. LQCD is important in high energy 
and nuclear physics as it is the only currently known first- 
principles non-perturbative approach for calculations involv- 
ing the strong force. Not only is LQCD important from the 
point of view of physics research, but historically LQCD 
codes have often been used to benchmark and test large 
scale computers. The balanced nature of QCD calculations, 
which require approximately 1 byte/flop in single precision, 
as well as their regular memory access and nearest neighbor 
communication patterns have meant that LQCD codes could 
be deployed quickly, scaled to large partitions, and used to 
exercise the CPU, memory system, and communications fab- 
ric. In GPU computing, LQCD has been highly successful 
in using various forms of data compression and mixed preci- 
sion solvers [3] to alleviate memory bandwidth contraints on 
the GPU in order to attain high performance. Our multi- 
GPU codes [4j [5] |6] are in production, performing capacity 
analysis calculations on systems at several facilities, includ- 
ing Lincoln and EcoG (NCSA), Longhorn (TACC), the "9g" 
and "lOg" clusters at Jefferson Laboratory (JLab), and Edge 
(LLNL). 

The challenge is now to scale LQCD computations into the 
0(100) GPU regime, which is required if large GPU systems 
are to replace the more traditional massively parallel multi- 
core supercomputers that are used for the gauge generation 
step of LQCD. Here, capability-class computing is required 
since the algorithms employed consist of single streams of 
Monte Carlo Markov chains, and so require strong scaling. 

Our previous multi-GPU implementations utilized a strat- 
egy of parallelizing in the time direction only, using tradi- 
tional iterative Krylov solvers (e.g., conjugate gradients). 
This severely limits the number of GPUs that can be used 
and thus maximum performance. In order to make head- 
way, it has become important to parallelize in additional di- 
mensions in order to give sublattices with improved surface- 
to-volume ratios and to explore algorithms that reduce the 
amount of communication altogether such as domain decom- 
posed approaches. 

In this paper, we make the following contributions: (i) 
we parallelize the application of the discretized Dirac op- 
erator in the QUDA library to communicate in multiple 
dimensions, (ii) we investigate the utility of an additive 
Schwarz domain-decomposed preconditioner for the Gener- 
alized Conjugate Residual (GCR) solver, and (iii) we per- 
form performance tests using (i) and (ii) on partitions of up 
to 256 GPUs. The paper is organized as follows: we present 
an outline of LQCD computations in Sec. [2] and discuss 
iterative linear solvers in Sec. [3] A brief overview of previ- 
ous and related work is given in Sec. [4] and in Sec. [5] we 
describe the QUDA library upon which this work is based. 
The implementation of the multi-dimensional, multi-GPU 
parallelization is discussed in Sec. [6] The construction of 
our optimized linear solvers are elaborated in Sec. [§] We 
present performance results in Sec. [9] Finally, we summa- 
rize and conclude in Sec. [To] 



2. LATTICE QCD 

Weakly coupled field theories such as quantum electrody- 
namics can by handled with perturbation theory. In QCD, 
however, at low energies perturbative expansions fail and a 
non-perturbative method is required. Lattice QCD is the 
only known, model independent, non-perturbative tool cur- 
rently available to perform QCD calculations. 

LQCD calculations are typically Monte-Carlo evaluations 
of a Euclidean-time path integral. A sequence of configu- 
rations of the gauge fields is generated in a process known 
as configuration generation. The gauge configurations are 
importance-sampled with respect to the lattice action and 
represent a snapshot of the QCD vacuum. Configuration 
generation is inherently sequential as one configuration is 
generated from the previous one using a stochastic evolu- 
tion process. Many variables can be updated in parallel and 
the focused power of capability computing systems has been 
essential. Once the field configurations have been generated, 
one moves on to the second stage of the calculation, known 
as analysis. In this phase, observables of interest are eval- 
uated on the gauge configurations in the ensemble, and the 
results are then averaged appropriately, to form ensemble 
averaged quantities. It is from the latter that physical re- 
sults such as particle energy spectra can be extracted. The 
analysis phase can be task parallelized over the available 
configurations in an ensemble and is thus extremely suitable 
for capacity level work on clusters, or smaller partitions of 
supercomputers. 



2.1 Dirac PDE discretization 

The fundamental interactions of QCD, those taking place 
between quarks and gluons, are encoded in the quark-gluon 
interaction differential operator known as the Dirac opera- 
tor. A proper discretization of the Dirac operator for lattice 
QCD requires special care. As is common in PDE solvers, 
the derivatives are replaced by finite differences. Thus on 
the lattice, the Dirac operator becomes a large sparse ma- 
trix, M, and the calculation of quark physics is essentially 
reduced to many solutions to systems of linear equations 
given by 

Mx = b. (1) 

Computationally, the brunt of the numerical work in LQCD 
for both the gauge generation and analysis phases involves 
solving such linear sytems. 

A small handful of discretizations are in common use, dif- 
fering in their theoretical properties. Here we focus on two 
of the most widely-used forms for M, the Sheikholeslami- 
Wohlert FF] form (colloquially known as Wilson-clover), and 
the improved staggered form, specifically the a 2 tadpole- 
improved (asqtad) formulation [8]. 



2.2 Wilson-clover matrix 

The Wilson-clover matrix is a central-difference discretiza- 
tion of the Dirac operator, with the addition of a diagonally- 
scaled Laplacian to remove the infamous fermion doublers 
(which arise due to the red-black instability of the central- 
difference approximation). When acting in a vector space 
that is the tensor product of a 4-dimensional discretized Eu- 



clidean spacetime, spin space, and color space it is given by 

M=l 

+ (4 + m + A x )8 x y 
= ~D%£ + (4 + m + A x )5 x , x ,. (2) 

Here is the Kronecker delta; P ±A1 are 4x4 matrix pro- 
jectors in spin space; U is the QCD gauge field which is a 
field of special unitary 3x3 (i.e., SU(3)) matrices acting in 
color space that live between the spacetime sites (and hence 
are referred to as link matrices); A x is the 12 x 12 clover ma- 
trix field acting in both spin and color spaced corresponding 
to a first order discretization correction; and m is the quark 
mass parameter. The indices x and x' are spacetime indices 
(the spin and color indices have been suppressed for brevity) . 
This matrix acts on a vector consisting of a complex-valued 
12-component color-spvnor (or just spinor) for each point 
in spacetime. We refer to the complete lattice vector as a 
spinor field. 
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Figure 1: The nearest neighbor stencil part of the 
lattice Dirac operator D, as denned in ([2]), in the 
/i — v plane. The color-spinor fields are located on 
the sites. The SU(3) color matrices U x are associ- 
ated with the links. The nearest neighbor nature of 
the stencil suggests a natural even-odd (red-black) 
coloring for the sites. 

2.3 Improved staggered matrix 

The staggered matrix again is a central-difference dis- 
cretization of the Dirac operator; however, the fermion dou- 
blers are removed through "staggering" the spin degrees of 
freedom onto neighboring lattice sites. This essentially re- 
duces the number of spin degrees of freedom per site from 
four to one, which reduces the computational burden signifi- 
cantly. This transformation, however, comes at the expense 
of increased discretization errors, and breaks the so-called 
quark-flavor symmtery. To reduce these discretization er- 
rors, the gauge field that connects nearest neighboring sites 
on the lattice (U x in Equation [5| is smeared, which essen- 

1 Each clover matrix has a Hermitian block diagonal, anti- 
Hermitian block off-diagonal structure, and can be fully de- 
scribed by 72 real numbers. 



tially is a local averaging of the field. There are many pre- 
scriptions for this averaging; and here we employ the pop- 
ular asqtad procedure [8]. The errors are further reduced 
through the inclusion of the third neighboring spinors in the 
derivative approximaton. The asqtad matrix is given by 

i 4 

M =l 

U x 5x+3£,x' + ^x-3fi,x') + m S x ,x' 

= ~2 D ^%' + mS x : x>- (3) 

Unlike M wc , the matrix M IS consists solely of a deriva- 
tive term D IS and the mass term. There are two gauge 
fields present: U x is the fat gauge field, and is the field 
produced from locally averaging U x ; and U x is the long 
gauge field produced by taking the product of the links 
UxU^ + j 1 U^ +2 j 1 - While both of these fields are functions 
of the original field U x , in practice, these fields are pre- 
calculated before the application of M* s x , since iterative 
solvers will require the application of M^. s x , many hundreds 
or thousands of times. Since there are no separate spin de- 
grees of freedom at each site, this matrix acts on a vector 
of complex-valued 3-component colors; however, for conve- 
nience we nevertheless refer to the complex lattice vector as 
a spinor field. 

3. ITERATIVE SOLVERS 
3.1 Krylov solvers 

For both discretizations under consideration, M is a large 
sparse matrix, and iterative Krylov solvers are typically used 
to obtain solutions to Equation ([!]), requiring many repeated 
evaluations of the sparse matrix- vector product. The Wilson- 
clover matrix is non- Hermitian, so either Conjugate Gradi- 
ents [9] on the normal equations (CGNE or CGNR) is used, 
or more commonly, the system is solved directly using a 
non-symmetric method, e.g., BiCGstab [To]. Even-odd (also 
known as red-black) preconditioning is almost always used 
to accelerate the solution finding process for this system, 
where the nearest neighbor property of the D wc matrix is 
exploited to solve the Schur complement system [11| . 

The staggered fermion matrix is anti-Hermitian, and has 
the convenient property that when multiplied by its Her- 
mitian conjugate, the even and odd lattices are decoupled 
and can be solved independently from each other. There are 
no commonly used preconditioners for the staggered matrix. 
When simulating asqtad fermions, for both the gauge field 
generation and for the analysis stages, one is confronted with 
solving problems of the form 

(M 1 M + a i I)x l = b i = l... N (4) 

where 0% is a constant scalar and / is the identity matrix. 
This is equivalent to solving N different linear systems at 
different mass parameters for a constant source b. Since the 
generated Krylov spaces are the same for each of these linear 
systems, one can use a multi-shift solver (also known as a 
multi-mass solver) to produce all N solutions simultaneously 
in the same number of iterations as the smallest shift (least 
well conditioned) [12| . 

In both cases, the quark mass controls the condition num- 
ber of the matrix, and hence the convergence of such iter- 



ative solvers. Unfortunately, physical quark masses corre- 
spond to nearly indefinite matrices. Given that current lat- 
tice volumes are at least 10 s degrees of freedom in total, this 
represents an extremely computationally demanding task. 
For both the gauge generation and analysis stages, the lin- 
ear solver accounts for 80-99% of the execution time. 

3.2 Additive Schwarz preconditioner 

As one scales lattice calculations to large core counts, 
on leadership-class partitions, one is faced with a strong- 
scaling challenge: as core counts increase, for a fixed global 
lattice volume the local sub-volume per core decreases and 
the surface-to- volume ratio of the local sub-lattice increases. 
Hence, the ratio of communication to local computation also 
grows. For sufficiently many cores, it becomes impossible 
to hide communication by local computation and the prob- 
lem becomes communications bound, at the mercy of the 
system interconnect. This fact, with the need for periodic 
global reduction operations in the Krylov solvers, leads to 
a slowdown of their performance in the large scaling limit. 
For GPU-based clusters, where inter-GPU communication 
is gated by the PCFE bus, this limitation is substantially 
more pronounced and can occur in partitions as small as 
O(10) CPUs or less. 

This challenge of strong scaling of traditional Krylov solvers 
motivates the use of solvers which minimize the amount 
of communication. Such solvers are commonly known as 
domain-decomposition solvers and two forms of them are 
commonly used: multiplicative Schwarz and additive Schwarz 
processes [13]. In this work we focus upon the additive 
Schwarz method. Here, the entire domain is partitioned 
into blocks which may or may not overlap. The system ma- 
trix is then solved within these blocks, imposing Dirichlet 
(zero) boundary conditions at the block boundaries. The 
imposition of Dirichlet boundary conditions means that no 
communication is required between the blocks, and that each 
block can be solved independently. It is therefore typical to 
assign the blocks to match the sub-domain assigned to each 
processor in a parallel decomposition of a domain. A tun- 
able parameter in these solvers is the degree of overlap of 
the blocks, with a greater degree of overlap corresponding 
to increasing the size of the blocks, and hence the amount 
of computation required to solve each block. A larger over- 
lap will typically lead to requiring fewer iterations to reach 
convergence, since, heuristically, the larger sub blocks, will 
approximate better the original matrix and hence their in- 
verses will form better preconditioners. Note that an addi- 
tive Schwarz solver with non-overlapping blocks is equivalent 
to a block- Jacobi solver. 

Typically, Schwarz solvers are not used as a standalone 
solver, but rather they are employed as preconditioners for 
an outer Krylov method. Since each local system matrix is 
usually solved using an iterative solver, this requires that 
the outer solver be a flexible solver. Generalized conjugate 
residual (GCR) is such a solver, and we shall employ it for 
the work in this paper. 

4. OVERVIEW OF RELATED WORK 

Lattice QCD calculations on GPUs were originally re- 
ported in 
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where the immaturity of using GPUs for gen- 
eral purpose computation necessitated the use of graph- 
ics APIs. Since the advent of CUDA in 2007, there has 
been rapid uptake by the LQCD community (see [15] for 



an overview). More recent work includes [16] , which tar- 
gets the computation of multiple systems of equations with 
Wilson fermions where the systems of equations are related 
by a linear shift. Solving such systems is of great utility 
in implementing the overlap formulation of QCD. This is 
a problem we target in the staggered-fermion solver below. 
The work in [17] targets the domain-wall fermion formula- 
tion of LQCD. This work concerns the QUDA library [3], 
which we describe in Sec. [5] below. 

Most work to date has concerned single-GPU LQCD im- 
plementations, and beyond the multi-GPU parallelization of 
QUDA [4] [6] and the work in [18] which targets a multi-GPU 
implementation of the overlap formulation, there has been 
little reported in the literature, though we are aware of other 
implementations which are in production [19] . 

Domain-decomposition algorithms were first introduced 
to LQCD in [20], through an implementation of the Schwarz 
Alternating Procedure preconditioner, which is a multiplica- 
tive Schwarz preconditioner. More akin to the work pre- 
sented here is the work in [2l] where a restricted additive 
Schwarz preconditioner was implemented for a GPU clus- 
ter. However, the work reported in [2l] was carried out on a 
rather small cluster containing only 4 nodes and connected 
with Gigabit Ethernet. The work presented here aims for 
scaling to O(100) GPUs using a QDR Infiniband intercon- 
nect. 

5. QUDA 

The QUDA library is a package of optimized CUDA ker- 
nels and wrapper code for the most time-consuming com- 
ponents of an LQCD computation. It has been designed to 
be easy to interface to existing code bases, and in this work 
we exploit this interface to use the popular LQCD applica- 
tions Chroma and MILC. The QUDA library has attracted 
a diverse developer community of late and is being used in 
production at LLNL, Jlab and other U.S. national laborato- 
ries, as well as in Europe. The latest development version is 
always available in a publically-accessible source code repos- 
itory [22] . 

QUDA implements optimized linear solvers, which when 
running on a single GPU achieve up to 24% of the CPU's 
peak performance through aggressive optimization. The 
general strategy is to assign a single GPU thread to each 
lattice site, each thread is then responsible for all memory 
traffic and operations required to update that site on the 
lattice given the stencil operator. Maximum memory band- 
width is obtained by reordering the spinor and gauge fields 
to achieve memory coalescing using structures of float2 or 
float4 arrays, and using the texture cache where appropri- 
ate. Memory traffic reduction is employed where possible to 
overcome the relatively low arithmetic intensity of the Dirac 
matrix-vector operations, which would otherwise limit per- 
formance. Strategies include: (a) using compression for the 
SU (3) gauge matrices to reduce the 18 real numbers to 12 
(or 8) real numbers at the expense of extra computation; 
(b) using similarity transforms to increase the sparsity of 
the Dirac matrices; (c) utilizing a custom 16-bit fixed-point 
storage format (here on referred to as half precision) to- 
gether with mixed-precision linear solvers to achieve high 
speed with no loss in accuracy. 

Other important computational kernels provided in the 
library include gauge field smearing routines for constructing 
the fat gauge field used in the asqtad variant of the improved 



staggered discretization, as well as force term computations 
required for gauge field generation. 

The extension of QUDA to support multiple GPUs was 
reported in [I], where both strong and weak scaling was per- 
formed on up to 32 GPUs using a lattice volume of 32 3 x 256 
with Wilson-clover fermions. This employed partitioning of 
the lattice along the time dimension only, and was moti- 
vated by expediency and the highly asymmetric nature of 
the lattices being studied. While this strategy was sufficient 
to achieve excellent (artificial) weak scaling performance, 
it severely limits the strong scaling achievable for realistic 
volumes because of the increase in surface-to-volume ratio. 
The application of these strategies to the improved stag- 
gered discretization was described in [6], where strong scal- 
ing was achieved on up to 8 GPUs using a lattice volume 
of 24 3 x 96. Here the single dimensional parallelization em- 
ployed restricts scaling more severely than for Wilson-clover 
because of the 3-hop stencil of the improved staggered op- 
erator which decreases the locality of the operator. 



Spinor field layout in host memory: 
one spinor field 



Spinor field layout in GPU memory: 
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6. MULTI-DIMENSIONAL PARTITIONING 
6.1 General strategy 

Because lattice discretizations of the Dirac operator gener- 
ally only couple sites of the lattice that are nearby in space- 
time, the first step in any parallelization strategy is to par- 
tition the lattice. As indicated above, prior to this work 
multi-GPU parallelization of the QUDA library had been 
carried out with the lattice partitioned along only a single di- 
mension. The time (T) dimension was chosen, first because 
typical lattice volumes are asymmetric with T longest, and 
secondly because this dimension corresponds to the slowest 
varying index in memory in our implementation, making it 
possible to transfer the boundary face from GPU to host 
with a straightforward series of memory copies. Going be- 
yond this approach requires much more general handling of 
data movement, computation, and synchronization, as we 
explore here. 

In the general case, upon partitioning the lattice each 
GPU is assigned a 4-dimensional subvolume that is bounded 
by at most eight 3-dimensional "faces." Updating sites of 
the spinor field on or near the boundary requires data from 
neighboring GPUs. The data received from a given neigh- 
bor is stored in a dedicated buffer on the GPU which we will 
refer to as a "ghost zone" (since it shadows data residing on 
the neighbor). Computational kernels are modified so as to 
be aware of the partitioning and read data from the appro- 
priate location — either from the array corresponding to the 
local subvolume or one of the ghost zones. Significant atten- 
tion is paid to maintaining memory coalescing and avoiding 
thread divergence, as detailed below. 

Ghost zones for the spinor field are placed in memory after 
the local spinor field so that BLAS-like routines, including 
global reductions, may be carried out efficently. While the 
ghost spinor data for the T dimension is contiguous and can 
be copied without a gather operation, the ghost spinor data 
for the other three dimensions must be collected into con- 
tiguous GPU memory buffers by a GPU kernel before it can 
be transfered to host memory. The ghost zone buffers are 
then exchanged between neighboring GPUs (possibly resid- 
ing in different nodes). Once inter-GPU communication is 
complete, the ghost zones are copied to locations adjoining 
the local array in GPU memory. Allocation of ghost zones 



Figure 2: Spinor field layout in host and GPU mem- 
ory for the staggered discretization (consisting of 6 
floating point numbers per site). Here Vh is half 
the local volume of the lattice, corresponding to the 
number of sites in an even/odd subset. Layout for 
the Wilson-clover discretization is similar, wherein 
the spinor field consists of 24 floating point numbers 
per site. 

and data exchange in a given dimension only takes place 
when that dimension is partitioned, so as to ensure that 
GPU memory as well as PCI-E and interconnect bandwidth 
are not wasted. Layout of the local spinor field, ghost zones, 
and padding regions are shown in Fig. [2] The padding region 
is of adjustable length and serves to reduce partition camp- 
ing [U [23] on previous-generation NVIDIA GPUsQ The 
gauge field is allocated with a similar padding region, and 
we use this space to store ghost zones for the gauge field, 
which must only be transfered once at the beginning of a 
solve. The layout is illustrated in Fig. [3] 

For communication, our implementation is capable of em- 
ploying either of two message-passing frameworks - MPI 
or QMP. The latter "QCD message-passing" standard was 
originally developed to provide a simplified subset of com- 
munication primitives most used by LQCD codes, allowing 
for optimized implementations on a variety of architectures, 
including purpose-built machines that lack MPI. Here we 
rely on the reference implementation, which serves as a thin 
layer over MPI itself (but nevertheless serves a purpose as 
the communications interface used natively by Chroma, for 
example). Accordingly, performance with the two frame- 
works is virtually identical. At present, we assign GPUs to 
separate processes which communicate via message-passing. 
Exploration of peer-to-peer memory copies, recently added 
in CUDA 4.0, and host-side multi-threading is underway. 

6.2 Interior and exterior kernels 

In [4j [6] , where only the time dimension of the lattice was 

2 This is less a concern for the Tesla M2050 cards used in 
this study, as the Fermi memory controller employs address 
hashing to alleviate this problem. 



Gauge field layout in host memory: 
one gauge field 



Gauge field layout in GPU memory: 
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Figure 3: Gauge field layout in host and GPU mem- 
ory. The gauge field consists of 18 floating point 
numbers per site (when no reconstruction is em- 
ployed) and is ordered on the GPU so as to en- 
sure that memory accesses in both interior and 
boundary-update kernels are coalesced to the extent 
possible. 



partitioned, we separated the application of the Dirac opera- 
tor into two kernels, one to update sites on the boundaries of 
the local sublattice (the exterior kernel) and one to perform 
all remaining work (the interior kernel). Here we extend 
this approach by introducing one exterior kernel for every 
dimension partitioned, giving a total of four exterior ker- 
nels in the most general case. The interior kernel executes 
first and computes the spinors interior to the subvolume, as 
well as any contributions to spinors on the boundaries that 
do not require data from the ghost zones. For example, if 
a spinor is located only on the T+ boundary, the interior 
kernel computes contributions to this spinor from all spatial 
dimensions, as well as that of the negative T direction. The 
contribution from the positive T direction will be computed 
in the T exterior kernel using the ghost spinor and gauge 
field from the T+ neighbor. Since spinors on the corners 
belong to multiple boundaries, they receive contributions 
from multiple exterior kernels. This introduces a data de- 
pendency between exterior kernels, which must therefore be 
executed sequentially. 

Another consideration is the ordering used for assigning 
threads to sites. For the interior kernel and T exterior ker- 
nel, the one-dimensional thread index (given in CUDA C 
by (blockIdx.x*blockDim.x + threadldx.x)) is assigned 
to sites of the four-dimensional sublattice in the same way 
that the spinor and gauge field data is ordered in memory, 
with X being the fastest varying index and T the slowest. It 
is thus guaranteed that all spinor and gauge field accesses are 
coalesced. In the X,Y,Z exterior kernels, however, only the 
destination spinors are indexed in this way, while the ghost 
spinor and gauge field are indexed according to a different 
mapping. This makes it impossible to guarantee coalescing 
for both reads and writes; one must choose one order or the 
other for assigning the thread index. We choose to employ 



the standard T-slowest mapping for the X,Y,Z exterior ker- 
nels to minimize the penalty of uncoalesced accesses, since 
a greater fraction of the data traffic comes from the gauge 
field and source spinors. 

6.3 Computation, communication, and streams 

Our implementation employs CUDA streams to overlap 
computation with communication, as well as to overlap GPU- 
to-host with inter-node communication. Two streams per di- 
mension are used, one for gathering and exchanging spinors 
in the forward direction and the other in the backward direc- 
tion. One additional stream is used for executing the interior 
and exterior kernels, giving a total of 9 streams as shown in 
Fig. [4] The gather kernels for all dimensions are launched 
on the GPU immediately so that communication in all di- 
rections can begin. The interior kernel is executed after all 
gather kernels finish, overlapping completely with the com- 
munication. We use different streams for different dimen- 
sions so that the different communication components can 
overlap with each other, including the device-to-host mem- 
ory copy, the copy from pinned host memory to pagable 
host memory, the MPI send and receive, the memory copy 
from pagable memory to pinned memory on the receiving 
side, and the final host-to-device memory copy. While the 
interior kernel can be overlapped with communications, the 
exterior kernels must wait for arrival of the ghost data. As a 
result, the interior kernel and subsequent exterior kernels are 
placed in the same stream, and each exterior kernel blocks 
waiting for communication in the corresponding dimension 
to finish. For small subvolumes, the total communication 
time over all dimensions is likely to exceed the interior ker- 
nel run time, resulting in some interval when the GPU is 
idle (see Fig. |4| and thus degrading overall performance. 
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Figure 4: Usage of CUDA streams in the applica- 
tion of the Dirac operator, illustrating the multiple 
stages of communication. A single stream is used for 
the interior and exterior kernels, and two streams 
per dimension are used for gather kernels, PCI-E 
data transfer, host memory copies, and inter-node 
communication. 

When communicating over multiple dimensions with small 



subvolumes, the communication cost dominates over compu- 
tation, and so any reduction in the communication is likely 
to improve performance. The two host memory copies are 
required due to the fact that GPU pinned memory is not 
compatible with memory pinned by MPI implementations; 
GPU-Direct [24] was not readily available on the cluster used 
in this study. We expect to be able to remove these ex- 
tra memory copies in the future when better support from 
GPU and MPI vendors is forthcoming. CUDA 4.0, recently 
released, includes a promising GPU-to-GPU direct commu- 
nication feature that we will explore in the future to further 
reduce the communication cost. 

7. DIRAC OPERATOR PERFORMANCE 

7.1 Hardware description 

For the numerical experiments discussed in this paper we 
used the Edge visualization cluster installed at Lawrence 
Livermore National Laboratory. Edge is comprised of a total 
of 216 nodes, of which 206 are compute nodes available for 
batch jobs. Each compute node is comprised of dual-socket 
six-core Intel X5660 Westmere CPUs running at 2.8GHz and 
two NVIDIA Tesla M2050 CPUs, running with error cor- 
rection (ECC) enabled. The two GPUs share a single xl6 
PCI-E connection to the I/O hub (IOH) via a switch. Eight 
of the remaining PCI-E lanes serve a quad data rate (QDR) 
InfiniBand interface which can thus run at full bandwidth. 
The compute nodes run a locally maintained derivative of a 
CentOS 5 kernel with revision 2 . 6 . 18-chaosl03. 

To build and run our software we used OpenMPI version 
1.5 built on top of the system GNU C/C++ compiler version 
4.1.2. To build and link against the QUDA library we used 
release candidate 1 of CUDA version 4.0 with driver version 
270.27. 

7.2 Wilson-clover 




operator on up to 256 GPUs. We see significant departures 
from ideal scaling for more than 32 GPUs, as increasing 
the surface-to-volume ratio increases the amount of time 
spent in communication, versus computation. It seems that, 
for more than 32 GPUs, we are no longer able to suffi- 
ciently overlap computation with communication, and the 
implementation becomes communications bound. We note 
also that as the communications overhead grows, the per- 
formance advantage of the half precision operator over the 
single precision operator appears diminished. The severity of 
the scaling violations seen here highlights the imbalance be- 
tween the communications and compute capability of GPU 
clusters. To overcome this constraint, algorithms which re- 
duce communication, such as the domain-decomposition al- 
gorithms described below, are absolutely essential. 

7.3 Improved staggered 

In Fig.[6j we plot the performance per GPU for the asqtad 
volume used in this study. A number of interesting obser- 
vations can be made about this plot. At a relatively low 
number of GPUs, where we are less communications-bound, 
having faster kernel performance is more important than the 
optimal surface-to-volume ratio. As the number of GPUs is 
increased, the minimization of the surface-to-volume ratio 
becomes increasingly important, and the XYZT partition- 
ing scheme, which has the worst single-GPU performance, 
obtains the best performance on 256 GPUs. 
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Figure 6: Strong-scaling performance for the asq- 
tad operator in double (DP) and single (SP) preci- 
sion. The legend labels denote which dimensions are 
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Figure 5: Strong-scaling results for the Wilson- 
clover operator in single (SP) and half (HP) pre- 
cisions (V = 32 3 x 256, 12 gauge reconstruction). 

In Fig. [5] we show the strong scaling of the Wilson-clover 



8. BUILDING SCALABLE SOLVERS 

8.1 Wilson-clover additive Schwarz precondi- 
tioner 

The poor scaling of the application of the Wilson-clover 
matrix at a high number of GPUs at this volume motivates 



the use of communication-reducing algorithms. In this ini- 
tial work, we investigate using a non-overlapping additive 
Schwarz preconditioner for GCR. In the text that follows, 
we refer to this algorithm as GCR-DD. 

Implementation of the preconditioner is simple: essen- 
tially, we just have to switch off the communications between 
GPUs. This means that in applying the Dirac operator, the 
sites that lie along the communication boundaries only re- 
ceive contributions from sites local to that node. Addition- 
ally, since the solution in each domain is independent from 
every other, the reductions required in each of the domain- 
specific linear solvers are restricted to that domain only. As 
a result the solution of the preconditioner linear system will 
operate at similar efficiency to the equivalent single-GPU 
performance at this local volume. The imposition of the 
Dirichlet boundary conditions upon the local lattice leads 
to a vastly reduced condition number. This, coupled with 
the fact that only a very loose approximation of the local 
system matrix is required, means that only a small number 
of steps of minimum residual (MR) are required to achieve 
satisfactory accuracy. 

The outer GCR solver builds a Krylov subspace, within 
which the residual is minimized and the corresponding solu- 
tion estimate is obtained. Unlike solvers with explicit recur- 
rence relations, e.g., CG, the GCR algorithm for the general 
matrix problem requires explicit orthogonalization at each 
step with respect to the previously generated Krylov space. 
Thus, the size of the Krylov space is limited by the compu- 
tational and memory costs of orthogonalization. After the 
limit of the Krylov space has been reached, the algorithm is 
restarted, and the Krylov space is rebuilt. 

Similar to what was reported for BiCGstab [5], we have 
found that using mixed precision provides significant accel- 
eration of the GCR-DD solver algorithm. We exclusively use 
half precision for solving the preconditioned system. This is 
natural since only a very loose approximation is required. 
Additionally, the restarting mechanism of GCR provides a 
natural opportunity for using mixed precision: the Krylov 
space is built up in low precision and restarted in high pre- 
cision. This approach also conserves the limited GPU mem- 
ory, allowing for larger Krylov spaces to be built, albeit 
in lower precision. We follow the implicit solution update 
scheme described in 
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since this reduces the orthogonal- 
ization overhead. For the physics of interest, the inherent 
noise present in the Monte Carlo gauge generation process 
is such that single-precision accuracy is sufficient. Thus we 
have found best performance using a single-half-half solver, 
where the GCR restarting is performed in single precision, 
the Krylov space construction and accompanying orthogo- 
nalization is done in half precision, and the preconditioner 
is solved in half precision. In minimizing the residual in half 
precision, there is the inherent risk of the iterated resid- 
ual straying too far from the true residual. Thus, we have 
added an early termination criteria for the Krylov subspace 
generation, where if the residual is decreased by more than 
a given tolerance 8 from the start of the Krylov subspace 
generation, the algorithm is restarted. The mixed-precision 
GCR-DD solver is illustrated in Algorithm [I] 

8.2 Improved staggered 

While using a multi-shift CG solver can lead to significant 
speedups on traditional architectures, their use is less clear- 
cut on GPUs: multi-shift solvers cannot be restarted, mean- 
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z k = M p k 
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end 



end 



Algorithm 1: Mixed-precision GCR-DD solver. Low- 
precision fields are indicated with a hat ("); e.g., x and x, 
correspond to high- and low-precision respectively. The 
domain-decomposed preconditioner matrix is denoted by 
K, and the desired solver tolerance is given by tol. The 
parameter k max denotes the maximum size of the Krylov 
subspace. 



ing that using a mixed-precision strategy is not possible; 
the extra BLASl-type linear algebra incurred is extremely 
bandwidth intensive and so can reduce performance signif- 
icantly; multi-shift solvers come with much larger memory 
requirements since one has to keep both the N solution and 
direction vectors in memory. 

With these restrictions in mind, we have employed a mod- 
ified multi-shift solver strategy where we solve Equation Q 
using a pure single-precision multi-shift CG solver and then 
use mixed-precision sequential CG, refining each of the Xi so- 
lution vectors until the desired tolerance has been reached (3 
This allows us to perform most of the operations in single- 
precision arithmetic while still achieving double-precision ac- 
curacy. Since double precision is not introduced until after 
the multi-shift solver is completed, the memory requirements 
are much lower than if a pure double-precision multi-shift 
solver were employed, allowing the problem to be solved on 



Unfortunately, such an algorithm is not amenable to the 
use of half precision since the solutions produced from the 
initial multi-shift solver would be too inaccurate, demand- 
ing a large degree of correction in the subsequent sequential 
refinements. 



a smaller number of GPUs. When compared to doing just 
sequential mixed-precision CG, the sustained performance 
measured in flops is significantly lower because of the in- 
creased linear algebra; however, the time to solution is sig- 
nificantly shorter. 

9. SOLVER PERFORMANCE RESULTS 
9.1 Wilson-clover 

Our Wilson-Clover solver benchmarks were run with the 
QUDA library being driven by the Chroma 25 code. The 
solves were performed on a lattice of volume 32^ x 256 lattice 
sites from a recent large scale production run, spanning sev- 
eral facilities including Cray machines at NICS and OLCF, 
as well as BlueGene/L facilities at LLNL and a BlueGene/P 
facility at Argonne National Laboratory (ANL). The quark 
mass used in the generation of the configuration corresponds 
to a pion mass of ~230 MeV in physical units 26 . 
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Figure 7: Sustained strong-scaling performance 
in Gfiops of the Wilson-clover mixed precision 
BiCGstab and GCR-DD solvers (V = 32 s x 256, 10 
steps of MR used to evaluate the preconditioner). 

In Fig. [7] we plot the sustained performance in Tflops of 
both the BiCGstab and GCR-DD solvers. For the BiCGstab 
solver, we can see that despite the multi-dimensional paral- 
lelization, we are unable to effectively scale past 32 GPUs 
because of the increased surface-to- volume ratio. The GCR- 
DD solver does not suffer from such problems and scales to 
256 GPUs. As described above, the raw flop count is not 
a good metric of actual speed since the the iteration count 
is a function of the local block size. In Fig. [8] we compare 
the actual time to solution between the two solvers. While 
at 32 GPUs BiCGstab is a superior solver, past this point 
GCR-DD exhibits significantly reduced time to solution, im- 
proving performance over BiCGstab by 1.52x, 1.63x, and 
1.64x at 64, 128, and 256 GPUs respectively. Despite the 
improvement in scaling, we see that at 256 GPUs we have 
reached the limit of this algorithm. While we have vastly 
reduced communication overhead by switching to GCR-DD, 
there is still a significant fraction of the computation that 



requires full communication. This causes an Amdahl's law 
effect to come into play, which is demonstrated by the fact 
that the slope of the slow down for GCR and BiCGstab is 
identical in moving from 128 to 256 GPUs. Additionally, 
we note that if we perform a single-GPU run with the same 
per-GPU volume as considered here for 256 GPUs, perfor- 
mance is almost a factor of two slower than that for a run 
corresponding to 16 GPUs. Presumably this is due to the 
GPU not being completely saturated at this small problem 
size. 
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Figure 8: Sustained strong-scaling time to solu- 
tion in seconds of the Wilson-clover mixed precision 
BiCGstab and GCR-DD solvers (V = 32 3 x 256, 10 
steps of MR used to evaluate the preconditioner) . 



In terms of raw performance, it can be seen that the GCR- 
DD solver achieves greater than 10 Tflops on partitions of 
128 GPUs and above. Thinking more conservatively, one can 
use the improvement factors in the time to solution between 
BiCGStab and GCR to assign an "effective BiCGStab per- 
formance" number to the GCR solves. On 128 GPUs GCR 
performs as if it were BiCGStab running at 9.95 Tflops, 
whereas on 256 GPUs it is as if it were BiCGStab running 
at 11.5 Tflops. To put the performance results reported here 
into perspective, in Fig. [9] we show a strong-scaling bench- 
mark from a variety of leadership computing systems on a 
lattice of the same volume as used here. Results are shown 
for the Jaguar Cray XT4 (recently retired) and Jaguar PF 
Cray XT5 systems at OLCF, as well as the Intrepid Blue- 
Gene/P facility at ANL. The performance range of 10-17 
Tflops is attained on partitions of size greater than 16,384 
cores on all these systems. Hence, we believe it is fair to 
say that the results obtained in this work are on par with 
capability-class systems. 

9.2 Improved staggered 

The results for improved staggered fermions were obtained 
using the QUDA library driven by the publicly available 

The 64 3 x 
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MIMD Lattice Collaboration (MILC) code 
192 gauge fields used for this study corresponds to a pion 
mass of ~320 MeV in physical units [281. 




Figure 9: Strong-scaling benchmarks on a lattice of 
size 32 3 x 256 from Cray XT4 (Jaguar), Cray XT5 
(JaguarPF) and BlueGene/P (Intrepid). Solves 
were done to double-precision accuracy. The Cray 
solvers used mixed (double-single) precision; the 
BG/P solver used pure double precision. 

In Fig. |10| we plot the performance of the mixed-precision 
multi-shift CG algorithm. When running the full solver, 
the minimum number of GPUs that can accommodate the 
task is 64. Reasonable strong scaling is observed, where we 
achieve a speed-up of 2.56x in moving from 64 to 256 GPUs. 

With 256 GPUs, we achieve 5.49 Tflops with double-single 
mixed precision. In Ref. [6], we observed an approximately 
20% increase in iteration count for the mixed precision solver, 
comparing to the pure double precision one. To put this in 
perspective, the CPU version of MILC running on Kraken, a 
Cray XT5 system housed at NICS, achieves 942 Gfiops with 
4096 CPU cores for the double precision multi-shift solver. 
This means one GPU computes approximately as fast as 74 
CPU cores in large-scale runs. 

10. CONCLUSIONS 

Our main result is to demonstrate that by the use of multi- 
dimensionsional parallelization and an additive Schwarz pre- 
conditioner, the Wilson-clover solver for lattice QCD can 
be successfully strong-scaled to over 100 GPUs. This is 
a significant achievement demonstrating that GPU clusters 
are capable of delivering in excess of 10 teraflops of perfor- 
mance, a minimal capability required to apply GPU clusters 
to the generation of lattice ensembles. Additionally, multi- 
dimensional parallelization has enabled the use of GPUs for 
asqtad solvers at leading-edge lattice volumes, a feat which 
was previously not possible because of the decreased locality 
of the asqtad operator. 

Clearly, the present use of a simple non-overlapping ad- 
ditive Schwarz preconditioner is only a first step. It is very 
likely that more sophisticated methods with overlapping do- 
mains or multiple levels of Schwarz-type blocking to take 
advantage of the multiple levels of memory locality that a 
GPU cluster offers can be devised to improve the scaling 



Figure 10: Sustained strong-scaling performance 
in Gfiops of the asqtad mixed-precision multi-shift 
solver. The legend labels denote which dimensions 
are partitioned between GPUs (V = 64 3 x 192). 

substantially. Moreover, we view GPUs and the use of the 
Schwarz preconditioner as parts of a larger restructuring of 
algorithms and software to address the inevitable future of 
heterogeneous architectures with deep memory hierarchies. 
We anticipate that the arsenal of tools needed for the future 
of lattice QCD and similarly structured problems (e.g., fi- 
nite difference problems, material simulations, etc.) at the 
exascale will include domain decomposition, mixed-precision 
solvers and data compression/recomputation strategies. 

11. ACKNOWLEDGMENTS 

We gratefully acknowledge the use of the Edge GPU clus- 
ter at Lawrence Livermore National Laboratory. This work 
was supported in part by NSF grants OCI-0946441, OCI- 
1060012, OCI-1060067, and PHY-0555234, as well as DOE 
grants DE-FC02-06ER41439, DE-FC02-06ER41440, DE-FC02- 
06ER41443, DE-FG02-91ER40661, and DE-FG02-91ER40676. 
BJ additionally acknowledges support under DOE grant DE- 
AC05-06OR23177, under which Jefferson Science Associates 
LLC manages and operates Jefferson Lab. GS is funded 
through the Institute for Advanced Computing Applications 
and Technologies (IACAT) at the University of Illinois at 
Urbana- Champaign. The U.S. Government retains a non- 
exclusive, paid-up, irrevocable, world-wide license to publish 
or reproduce this manuscript for U.S. Government purposes. 

12. REFERENCES 

[1] http : //en . wikipedia . org/ wiki/Grand_Challenge 
2011. 

[2] http : / /www . research . ibm . com/bluegene/BG_ 
External_Presentation_January_2002 .pdf , 2002. 

[3] M. A. Clark, R. Babich, K. Barros, R. C. Brower, and 
C. Rebbi, "Solving Lattice QCD systems of equations 
using mixed precision solvers on GPUs " \Comput. \ 
Phys. Commun. 181 (2010) 1517-1528| 
arXiv: 091 1.3191 [hep-lat]| 



[4] R. Babich, M. A. Clark, and B. Joo, "Parallelizing the 
QUDA Library for Multi-GPU Calculations in Lattice 
Quantum Chromodynamics," in Proceedings of the 
2010 ACM /IEEE International Conference for High 
Performance Computing, Networking, Storage and 
Analysis, SC '10, pp. 1-11. IEEE Computer Society, 
Washington, DC, USA, 2010. |arXiv : 1011 . 0024~ 
| [hep-lat] | 

[5] S. Gottlieb, G. Shi, A. Torok, and V. Kindratenko, 
"QUDA programming for staggered quarks," PoS 
LATTICE2010 (2010) 026. 

[6] G. Shi, S. Gottlieb, A. Torok, and V. V. Kindratenko, 
"Design of MILC lattice QCD application for GPU 
clusters," in IPDPS. IEEE, 2011. 

[7] B. Sheikholeslami and R. Wohlert, "Improved 

Continuum Limit Lattice Action for QCD with Wilson 
Fermions," \Nucl. Phys. B259 (1985) 572| 

[8] A. Bazavov, D. Toussaint, C. Bernard, J. Laiho, 

C. DeTar, L. Levkova, M. B. Oktay, S. Gottlieb, U. M. 
Heller, J. E. Hetrick, P. B. Mackenzie, R. Sugar, and 
R. S. Van de Water, "Nonperturbative QCD 
simulations with 2 + 1 flavors of improved staggered 
quarks," \Rev. Mod. Phys. 82 no. 2, (May, 2010) 
11349-14171 

[9] M. R. Hestenes and E. Stiefel, "Methods of Conjugate 
Gradients for Solving Linear Systems," Journal of 
Research of the National Bureau of Standards 49 
no. 6, (Dec, 1952) 409-436. 

[10] H. A. van der Vorst, "Bi-CGSTAB: A Fast and 
Smoothly Converging Variant of Bi-CG for the 
Solution of Nonsymmetric Linear Systems," SIAM 
Journal on Scientific and Statistical Computing 13 
no. 2, (1992) 631-644. 

[11] T. A. Degrand and P. Rossi, "Conditioning techniques 
for dynamical fermions," Computer Physics 
\Communications 60 no. 2, (1990) 211 - 214] 

[12] B. Jegerlehner, "Krylov space solvers for shifted linear 
systems," arXiv:hep-lat/9612014 

[13] H. A. Schwarz, "Uber einen Grenziibergang durch 
alternierendes Verfahren," Vierteljahrsschrift der 
Naturforschenden Gesellschaft in Zurich 15 (1870) 
272-286. 

[14] G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, 

D. Nogradi, and K. K. Szabo, "Lattice QCD as a video 
game," Computer Physics Communications 177 no. 8, 
|(2007) 631 - 639||arXiv: 061 1022 [hep-lat] | 

[15] M. A. Clark, "QCD on CPUs: cost effective 

supercomputing," PoS LATTICE2009 (2009) 003. 

[16] A. Alexandra, C. Pelissier, B. Gamari, and F. Lee, 
"Multi-mass solvers for lattice QCD on GPUs," 
|arXiv: 1103. 5103 [hep-latT| 

[17] TWQCD Collaboration, T.-W. Chiu, T.-H. Hsieh, 
Y.-Y. Mao, and K. Ogawa, "GPU-Based Conjugate 
Gradient Solver for Lattice QCD with Domain- Wall 
Fermions," PoS LATTICE2010 (2010) 030, 
|arXiv: 1101 .0423 [hep-latT| 

[18] A. Alexandra, M. Lujan, C. Pelissier, B. Gamari, and 
F. X. Lee, "Efficient implementation of the overlap 
operator on multi- GPUs," |arXiv: 1106.4964 | 
| [hep-lat] | 

[19] S. Borsani, "Thermodynamics from accelerated 
architectures." 



http : / /crunch. ikp .physik . tu-darmstadt . de/ 
gpu2011/Talks/Borsanyi_Darmstadt_GPU.pdf 2011. 

[20] M. Luscher, "Solution of the Dirac equation in lattice 
QCD using a domain decomposition method," 
\Comput.Phys.Commun. 156 (2004) 209-220] 
|arXiv:hep-lat/0310048 [hep-lat] | 

[21] Y. Osaki and K.-I. Ishikawa, "Domain Decomposition 
method on GPU cluster," PoS LATTICE2010 
(2010) 036, |arXiv: 1011 .3318 [hep-latT| 

[22] http://lattice.github.com/quda, 2011. 

[23] G. Ruetsch and P. Micikevicius, "Optimizing matrix 
transpose in CUDA," NVIDIA Technical Report 
(2009) . 

[24] http : //www .mellanox . com/pdf /whitepapers/TB_ 
GPU_Direct.pdf, 2010. 

[25] R. G. Edwards and B. Joo, "The Chroma software 

system for lattice QCD," Nucl. Phys. Proc. Suppl. 140 
(200 5) 832| |arXiv : hep-lat/ 0409003 

[26] H.-W. Lin et al, "First results from 2+1 dynamical 
quark flavors on an anisotropic lattice: light-hadron 
spectroscopy and setting the strange-quark mass," 
\ Phys. Rev D79 (2009) 034502||arXiv: 0810 .3588 
[hep-ph] 

[27] MIMD Lattic e Collaboration, C. Bernard et al., "The 
MILC Code." |http: ~~| 
//www .physics .Utah. edu/~detar/milc/milcv7 .pdf 
2010. 

[28] A. Bazavov, D. Toussaint, C. Bernard, J. Laiho, 

C. DeTar, et al., "Nonperturbative QCD simulations 
with 2+1 flavors of improved staggered quarks," 
\Rev. Mod. Phys. 82 (2010) 1349^1417] 
arXiv : 0903 . 3598 [hep-lat] 



